Crosslingual and Multilingual Construction of Syntax-Based Vector Space Models

نویسندگان

  • Jason Utt
  • Sebastian Padó
چکیده

Syntax-based distributional models of lexical semantics provide a flexible and linguistically adequate representation of co-occurrence information. However, their construction requires large, accurately parsed corpora, which are unavailable for most languages. In this paper, we develop a number of methods to overcome this obstacle. We describe (a) a crosslingual approach that constructs a syntax-based model for a new language requiring only an English resource and a translation lexicon; and (b) multilingual approaches that combine crosslingual with monolingual information, subject to availability. We evaluate on two lexical semantic benchmarks in German and Croatian. We find that the models exhibit complementary profiles: crosslingual models yield higher accuracies while monolingual models provide better coverage. In addition, we show that simple multilingual models can successfully combine their strengths.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multilingual Training of Crosslingual Word Embeddings

Crosslingual word embeddings represent lexical items from different languages using the same vector space, enabling crosslingual transfer. Most prior work constructs embeddings for a pair of languages, with English on one side. We investigate methods for building high quality crosslingual word embeddings for many languages in a unified vector space. In this way, we can exploit and combine infor...

متن کامل

Crosslingual speech recognition with multilingual acoustic models based on agglomerative and tree-based triphone clustering

The paper describes our ongoing work on crosslingual speech recognition based on multilingual triphone hidden Markov models. Multilingual acoustic models were built using two different clustering procedures: agglomerative triphone clustering and tree-based triphone clustering. The agglomerative clustering procedure is based on measuring the similarity of triphones on a phoneme level where the m...

متن کامل

Multilingual and Crosslingual Speech Recognition

This paper describes the design of a multilingual speech recognizer using an LVCSR dictation database which has been collected under the project GlobalPhone. This project at the University of Karlsruhe investigates LVCSR systems in 15 languages of the world, namely Arabic, Chinese, Croatian, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Swedish, Tamil, and Tu...

متن کامل

Using Parallel Corpora to enrich Multilingual Lexical Resources

This paper describes the use of a bilingual vector model for the automatic discovery of German translations of English terms. The model is built by analysing co-occurence patterns in a parallel corpus of English and German medical abstracts, a method also used for CrossLingual Information Retrieval. The model generates candidate German translations of English words using the cosine similarity m...

متن کامل

CroVeWA: Crosslingual Vector-Based Writing Assistance

We present an interactive web-based writing assistance system that is based on recent advances in crosslingual compositional distributed semantics. Given queries in Japanese or English, our system can retrieve semantically related sentences from high quality English corpora. By employing crosslingually constrained vector space models to represent phrases, our system naturally sidesteps several ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • TACL

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2014